Model Selection

STA 101 - Summer I 2022

Raphael Morsomme

Welcome

Announcements

  • assignments

Recap

Competing Models

  • Earlier, we saw that adding the predictor \(\dfrac{1}{\text{displ}}\) gave a better fit.

  • Let us see if the same idea work with the trees dataset.

a nonlinear association?

Suppose we want to predict volume using only the variable girth.

d_tree <- datasets::trees
ggplot(d_tree) +
  geom_point(aes(Girth, Volume))

One could argue that there is a slight nonlinear association

R function to compute \(R^2\)

compute_R2 <- function(model){
  
  model_glanced <- glance(model)
  R2 <- model_glanced$r.squared
  R2_rounded <- round(R2, 3) 
  
  return(R2_rounded)
  
}

Starting simple…

We start with the simple model

\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} \]

m1 <- lm(Volume ~ Girth, data = d_tree)
compute_R2(m1)
[1] 0.935

…taking it up a notch…

To capture the nonlinear association between Girth and Volume, we consider the predictor \(\text{Girth}^2\).

\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 \]

The R to fit this model is

d_tree2 <- mutate(d_tree, Girth2 = Girth^2)
m2 <- lm(Volume ~ Girth + Girth2, data = d_tree2)
compute_R2(m2)
[1] 0.962

\(R^2\) has increased! It went from 0.935 to 0.962.

…before taking things to the extreme

What if we also include the predictors \(\text{Girth}^3, \text{Girth}^4, \dots, \text{Girth}^{k}\) for some larger number \(k\)?

The following R code fits such a model with \(k=29\), that is, \[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 + \dots + \beta_{29} \text{girth}^{29} \]

m_extreme <- lm(Volume ~ poly(Girth, degree = 30, raw = TRUE), data = d_tree)
compute_R2(m_extreme)
[1] 0.975

\(R^2\) has again increased! It went from 0.962 to 0.975.

Overfitting

In fact, if we keep adding predictors, \(R^2\) will always increase. - additional predictors allow the regression line to be more flexible, hence to be closer to the points and reduce the residuals.

Is the m_extreme model a good model? - does it accurately represent the population? - remember that we want to learn about the relation between Volume and Girth present in the population (not the sample).

Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.01)
new_data <- tibble(Girth = Girth_new)
Volume_pred <- predict(m_extreme, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)
# A tibble: 6 x 2
  Girth Volume_pred
  <dbl>       <dbl>
1  8.3         10.2
2  8.31        10.3
3  8.32        10.4
4  8.33        10.5
5  8.34        10.6
6  8.35        10.7
ggplot() +
  geom_point(data = d_tree  , aes(Girth, Volume     )) +
  geom_line (data = new_data, aes(Girth, Volume_pred))

ggplot() +
  geom_point(data = d_tree  , aes(Girth, Volume     )) +
  geom_line (data = new_data, aes(Girth, Volume_pred)) +
  ylim(10, 77)

The model m_extreme overfits the data.

A model overfits data when it corresponds very closely to the data set and does a poor job for new data.

Remember that we want to learn about the population, not the sample!

Model selection: criteria

Adjusted-\(R^2\)

We saw that \(R^2\) keeps increasing as we add predictors.

\[ R^2_{\text{adj}} = 1 - \dfrac{SSR}{SST} \]

\(R^2\) can therefore not be used to identify models that overfit the data.

Instead, we use the adjusted-\(R^2\).

\[ R^2_{\text{adj}} = 1 - \dfrac{SSR}{SST}\dfrac{n-1}{n-p-1} \]

where \(p\) corresponds to the number of predictors.

The ratio \(\dfrac{SSR}{SST}\) favors model that closely fit the data.

The ratio \(\dfrac{n-1}{n-k-1}\) penalizes model with many predictors.

The model with the highest adjusted-\(R^2\) typically hits the sweet spot.

AIC and BIC

Two popular criteria that balance goodness of fit (small SSR) and parsimony (small \(p\)) are

  • the Akaike Information Criterion (AIC)
  • the Bayesian Inofrmation Criterion (BIC)

The formula for AIC and BIC are respectively \[ AIC = 2p - \text{GoF}, \qquad BIC = \ln(n)p- \text{GoF} \] where \(\text{GoF}\) is a measure of the Goodness of fit of the model1.

Unlike the adjusted-\(R^2\), smaller is better for the AIC and BIC.

Note that the BIC penalizes the number of parameters \(p\) more strongly.

Computing AIC and BIC

Easily accessible in R with the command glance. For instance,

glance(m1)
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.935         0.933  4.25      419. 8.64e-19     1  -87.8  182.  186.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(m2)
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.962         0.959  3.33      350. 1.52e-20     2  -79.7  167.  173.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(m_extreme)
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.975         0.956  3.44      51.4 5.11e-11    13  -73.0  176.  197.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

In this case, all three criteria (ajudtsed-\(R^2\), AIC and BIC) indicate that m2 id the best model.

  • For AIC and BIC, smaller is better.

Model selection: predictive performance

Limitations of the previous methods

The adjutsed-\(R^2\), AIC and BIC all try to balance

  1. goodness of fit
  2. parsimony

They achieve this balance by favoring models with small SSR while penalizing models with larger \(p\).

The form of the penalty for \(p\) may seem somewhat arbitrary. - E.g. AIC versus BIC.

Predictive performance

Instead, we could look for the model with the best predictions performance; that is, the model that makes predictions for new observations that are the closest to the true values.

The holdout method

The holdout method is a simple method to evaluate the predictive performance of a model.

  1. Randomly partition your sample into two sets: a training set (typically 2/3 of the sample) and a test set.
  1. Fit your model to the training set.
  1. Evaluate the prediction accuracy on the test set.

Note that the test set consists of new observations for the model.

A good model will model will have good prediction accuracy in step 3

Step 1: training and test sets

The following R function splits a sample into a training and a test set

construct_training_test <- function(sample, prop = 2/3){
  
  n          <- nrow(sample)
  n_training <- round(n*prop)
  
  sample_random   <- slice_sample(sample, n = n)
  sample_training <- slice(sample_random,   1:n_training )
  sample_test     <- slice(sample_random, -(1:n_training))
  
  return(list(
    training = sample_training, test = sample_test
    ))
  
}

d_car <- ggplot2::mpg
training_test_sets <- construct_training_test(d_car) 
training_set <- training_test_sets[["training"]]
training_set
# A tibble: 156 x 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4 quattro   1.8  1999     4 manu~ 4        18    26 p     comp~
 2 audi         a4           2    2008     4 auto~ f        21    30 p     comp~
 3 honda        civic        1.6  1999     4 auto~ f        24    32 r     subc~
 4 toyota       camry sol~   2.2  1999     4 auto~ f        21    27 r     comp~
 5 toyota       camry sol~   2.4  2008     4 manu~ f        21    31 r     comp~
 6 audi         a4           1.8  1999     4 auto~ f        18    29 p     comp~
 7 audi         a6 quattro   3.1  2008     6 auto~ 4        17    25 p     mids~
 8 lincoln      navigator~   5.4  1999     8 auto~ r        11    16 p     suv  
 9 land rover   range rov~   4.4  2008     8 auto~ 4        12    18 r     suv  
10 toyota       land crui~   5.7  2008     8 auto~ 4        13    18 r     suv  
# ... with 146 more rows
test_set     <- training_test_sets[["test"]]
test_set
# A tibble: 78 x 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 volkswagen   jetta        2.8  1999     6 manu~ f        17    24 r     comp~
 2 ford         explorer ~   4    1999     6 manu~ 4        15    19 r     suv  
 3 nissan       altima       2.5  2008     4 auto~ f        23    31 r     mids~
 4 chevrolet    malibu       3.1  1999     6 auto~ f        18    26 r     mids~
 5 chevrolet    malibu       2.4  2008     4 auto~ f        22    30 r     mids~
 6 dodge        caravan 2~   3.3  2008     6 auto~ f        11    17 e     mini~
 7 volkswagen   jetta        2    2008     4 manu~ f        21    29 p     comp~
 8 volkswagen   gti          2    2008     4 auto~ f        22    29 p     comp~
 9 toyota       corolla      1.8  2008     4 manu~ f        28    37 r     comp~
10 dodge        caravan 2~   3.3  2008     6 auto~ f        17    24 r     mini~
# ... with 68 more rows

Step 2: fit the model to the training set

We now fit our regression model to the training set, as we have done many times.

m <- lm(hwy ~ cty, data = training_set)

Step 3: Evaluate the prediction accuracy on the test set

To evaluate the prediction accuracy, we start by computing the predictions for the test set.

hwy_hat <- predict(m, test_set)

A good model will make predictions that are closed to the true values of the response variable.

A common measure of prediction accuracy is the root mean of squared errors (RMSE):

\[ RMSE = \sqrt{\dfrac{SSE}{m}} = \sqrt{\dfrac{(y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_m - \hat{y}_m)^2}{m}} \]

where \(m\) corresponds to the size of the test set.

Selecting a model

Apply steps 2 and 3 on different models.

  • use the same training and test sets for the different models

Simply choose the model with the lowest SSE.

This model has the best out-of-sample accuracy.

Cross-validation

Cross-validation (CV) is the same as the holdout method, but repeated many times.

drawback of the holdout method is that the test set matters a lot.

Repeating steps 2 and 3 with a different partition from step 1 may give different results.

Idea of CV: let each observation be in the test set once

\(k\)-fold CV

Choose a number of folds \(k\) (typically \(5\) or \(10\)).

  1. Partition the data into \(k\) folds of equal size.
  1. Let the test set be composed of fold \(1\) and the training set of the other folds.
  1. Apply steps 2 and 3 of the holdout method.
  1. Go back to step 2, this time letting the next fold be the test set.

set.seed(345)

n_folds <- 10

county_2019_nc_folds <- county_2019_nc %>%
  slice_sample(n = nrow(county_2019_nc)) %>%
  mutate(fold = rep(1:n_folds, n_folds)) %>%
  arrange(fold)

predict_folds <- function(i) {
  fit <- lm(uninsured ~ hs_grad, data = county_2019_nc_folds %>% filter(fold != i))
  predict(fit, newdata = county_2019_nc_folds %>% filter(fold == i)) %>%
    bind_cols(county_2019_nc_folds %>% filter(fold == i), .fitted = .)
}

nc_fits <- map_df(1:n_folds, predict_folds)

p_nc_fits <- ggplot(nc_fits, aes(x = hs_grad, y = .fitted, group = fold)) +
  geom_line(stat = "smooth", method = "lm", se = FALSE, size = 0.3, alpha = 0.5) +
  scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
  scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
  labs(
    x = "High school graduate", y = "Uninsured",
    title = "Predicted uninsurance rate in NC",
    subtitle = glue("For {n_folds} different testing datasets")
    )

p_nc_fits
set.seed(123)

n_folds <- 10

county_2019_ny_folds <- county_2019_ny %>%
  slice_sample(n = nrow(county_2019_ny)) %>%
  mutate(fold = c(rep(1:n_folds, 6), 1, 2)) %>%
  arrange(fold)

predict_folds <- function(i) {
  fit <- lm(uninsured ~ hs_grad, data = county_2019_ny_folds %>% filter(fold != i))
  predict(fit, newdata = county_2019_ny_folds %>% filter(fold == i)) %>%
    bind_cols(county_2019_ny_folds %>% filter(fold == i), .fitted = .)
}

ny_fits <- map_df(1:n_folds, predict_folds)

p_ny_fits <- ggplot(ny_fits, aes(x = hs_grad, y = .fitted, group = fold)) +
  geom_line(stat = "smooth", method = "lm", se = FALSE, size = 0.3, alpha = 0.5) +
  scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
  scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
  labs(
    x = "High school graduate", y = "Uninsured",
    title = "Predicted uninsurance rate in NY",
    subtitle = glue("For {n_folds} different testing datasets")
    )

p_ny_fits

Taking things to the extreme: LOOCV

Set \(k=n\); that is, we use \(n\) folds, each of size \(1\). The test sets will therefore consist of a single observation and the training sets of \(n-1\) observations.

Model selection: stepwise procedures

  • Not my favorite method,

  • but this is something you should learn because it is widely used.

. .

  • Akin the “throw cooked spaghetti to the wall and see what sticks” technique.

Stepwise selection procedures ar of two types: (i) forward and (ii) backward.

Forward selection procedure

  1. Choose a criterion that balances model fit (smaller SSR) and parsimony (small \(k\)).
  1. e.g. adjusted-\(R^2\), AIC or BIC
  1. Start with the empty model \(Y \approx \beta_0\), i.e. the model with no predictor. This is our current model

  2. Fit all possible models with one additional predictor.

. .

  1. Compute the AIC of each of these models
  1. Identify the model with the smallest AIC. This is our candidate model.
  1. If the AIC of the candidate model is smaller (better) than the AIC of the current model, the candidate model becomes the current model, and we go back to step 3.

  2. If the AIC of the candidate model is larger than the AIC of the current model (no new model improves on the current one), the procedure stops, and we select the current model.

Backward selection procedure

Similar to forward selection, except that

  • we start with the full model,
  • remove one predictor at a time
  • until removing any predictors makes the model worse.

Note that forward and backward selection need not agree; they may select different models.

  • What to do in that case? Nobody knows.

Group exercise - stepwise selection

  • Exercises 8.11, 8.13

  • In addition

  1. fit the first models in R
  2. identify the baseline level of the categorical variable in each model.
library(openintro)
d <- openintro::births14
05:00

How to prevent overfiting?

Parsimony

Use domain knowledge

one-in-ten rule (could be one-in-five)

Multicollinearity